Web Directories as Training Data for Automated Metadata Extraction
نویسندگان
چکیده
Although man-made annotations are considered as the main ‘knowledge fuel’ for the Semantic Web, the majority of existing commercial pages are still poorly equipped with any kind of metadata, never mind the forthcoming standards such as the RDF syntax or the Dublin Core semantics. Information Extraction, relying on characteristic patterns in text, can be applied even on such ‘legacy’ pages, in order to obtain metadata containing, for example, the names, types, and domains of activity of the WWW subjects (companies).
منابع مشابه
On the Automated Classification of Web Sites
In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata in relation to requirements for text features. We find that HTML metatags are a good source of text features, but are not in wide use despite their role in search engine rankings. We present an approach for targeted spidering including metadata extract...
متن کاملPractical Issues for Automated Categorization of Web Sites
In this paper we discuss several issues related to automated text classification of web sites. We analyze the nature of web content and metadata and requirements for text features. We present an approach for targeted spidering including metadata extraction and opportunistic crawling of specific semantic hyperlinks. We describe a system for automatically classifying web sites into industry categ...
متن کاملOntea: Platform for Pattern Based Automated Semantic Annotation
Automated annotation of web documents is a key challenge of the Semantic Web effort. Semantic metadata can be created manually or using automated annotation or tagging tools. Automated semantic annotation tools with best results are built on various machine learning algorithms which require training sets. Other approach is to use pattern based semantic annotation solutions built on natural lang...
متن کاملTowards Large Scale Semantic Annotation Built on MapReduce Architecture
Automated annotation of the web documents is a key challenge of the Semantic Web effort. Web documents are structured but their structure is understandable only for a human that is the major problem of the Semantic Web. Semantic Web can be exploited only if metadata understood by a computer reach critical mass. Semantic metadata can be created manually, using automated annotation or tagging too...
متن کاملAutomatic Generation of RDF Metadata final version
The Resource Description Framework (RDF[9]) has been developed to fulfil the need for a mechanism for resource description within the Web's architecture. With over 320 million[10] individually accessible objects on the Web, the ability to describe each one so that it can be conceptualized without being accessed and analyzed is increasingly important. This paper describes how an automatic classi...
متن کامل